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1.  SUMMARY 


This  final  report  contains  the  results  of  our  work  on  DSPACE:  Distributed  Sensing  and 
Processing  Adaptive  Collaborative  Environment.  During  this  1-year  effort,  we  developed  a 
formal  process,  algorithms,  and  software  (SW)  prototype  to  enable  processing  of  distributed 
relational  data  across  multiple  autonomous  heterogeneous  computing  resources  in  environments  with 
limited  control,  resource  failures,  and  communication  bottlenecks.  This  work  was  accomplished 
through  five  tasks. 

•  Task  1  developed  a  formal  process  to  perform  data  querying  via  inexact  graph  matching 
using  the  belief  propagation  algorithm. 

•  Task  2  developed  a  distributed  collaborative  querying  model  using  the  belief  propagation 
algorithm.  This  includes  a  formal  representation  of  the  collaborative  data  analysis 
problem,  a  set  of  methodologies  for  query  partitioning  and  task  assignment,  the 
communication  message  structure  and  the  collaborative  updates  required  for  the 
autonomous  agents  to  perform  their  tasks. 

•  Task  3  developed  a  SW  prototype  of  the  distributed  graph  matching  Ifamework  using  the 
Java  Message  Service  (JMS)  API  and  the  Bulk  Synchronous  Parallel  (BSP)  processing 
paradigm. 

•  Task  4  developed  a  set  of  algorithms  to  improve  the  solution’s  scalability  and  reliability. 
This  includes  adaptive  information  prioritization  based  filtering,  communication 
compression  algorithms,  and  strategies  for  detection  and  recovery  from  agent. 

•  Task  5  conducted  validation  experiments  using  randomly  generated  graph  data  as  well  as 
open  source  research  data  to  assess  the  sensitivity  of  system’s  search  accuracy  to  the  size 
and  noise  in  the  data,  and  showcasing  benefits  and  trade-offs  of  distributed  search  model 
using  the  various  scalability  strategies. 

The  aforementioned  tasks  are  described  in  detail  in  the  following  sections. 
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2.  INTRODUCTION 


2.1  Motivation  and  Summary  of  Accomplishments 


Amounts  of  data  that  need  to  be  collected,  examined  and  shared  during  Intelligence, 

Surveillance,  and  Reconnaissance  (ISR)  operations  are  growing  fast  due  to  increasing  sensor  use. 
In  order  to  manage  the  resulting  data  deluge,  the  U.S.  intelligence  community  is  moving  away 
from  manual  data  analysis  toward  automated  processing  capabilities,  with  the  focus  on  two 
technologies:  Cloud-based  distributed  processing,  and  autonomous  cooperative  data  exploitation  (Figure 
1). 


(a)  Data  is  collected  by  distributed  sensors 


m 


(b)  Traditional  manual  data 
analysis  creates  stove-pipes  and 
analysts"  overload 

@  g  g  @  g 


V  V 


(c)  Cloud  solutions  require  data  to 
be  moved  to  centralized  location 


(d)  Collaborative  solutions  achieve 
faster  analytics  and  avoid  security 
problems  by  preserving  data  locality 


@  g  @  @  g 


Command  Center 


Figure  1:  Analysis  of  data  collected  by  multiple  heterogeneous  sources  (a)  is  moving  away  from  manual  stove- 
pined  processing  (b)  to  Cloud-based  (c)  or  autonomous  collaborative  solutions  (d) 

DSPACE  focuses  on  the  problem  of  data  exploitation  in  denied  areas,  where  the  eontrol  of 
analytical  operations  and  eoordination  between  distributed  data  aecess  and  computing  resources, 
and  even  existenee  of  these  resourees,  can  be  disrupted.  Cloud-based  technologies  are 
inappropriate  for  this  domain  due  to  weakness  of  eontrol  over  eomputational  units  and  inability 
to  move  the  data  to  a  central  warehouse  where  it  could  be  indexed,  partitioned  and  analyzed  in 
parallel  in  a  fully  controlled  environment.  Our  model  provides  a  solution  for  autonomous 
cooperative  exploitation  of  relational  data,  whieh  is  eneountered  in  a  variety  of  applications 
ranging  from  geospatial  analysis  to  open  souree  mining. 

During  this  one  year  effort,  we  developed  a  model  for  processing  distributed  data  aeross  multiple 
heterogeneous  computing  resources.  Our  model  exploits  the  dependencies  in  the  data  to  provide 
solutions  to  both  distributed  querying  and  pattern  learning.  In  distributed  querying  mode,  the 
computing  resources  are  assigned  subsets  of  the  query  based  on  high-level  information  about  the 
data  they  have  aceess  to.  This  query  decomposition  is  designed  to  aehieve  the  highest  quality  of 
search  and  optimal  balance  of  eomputational  load  between  available  resources.  The  resourees 
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find  local  query  match  estimates  and  eollaborate  by  exehanging  belief  messages  that  efficiently 
encode  how  the  agents  can  influence  each  other’s  estimates. 

In  distributed  pattern  learning  mode,  the  computational  resources  learn  partial  correlations 
between  the  data  they  have  aceess  to,  transfer  this  knowledge  to  other  agents,  and  eollaboratively 
learn  the  patterns  within  and  between  their  loeal  data  subsets. 

Our  model  aehieves  better-than-linear  eomputational  eomplexity  by  using  several  concepts  from 
probabilistic  data  analysis  and  belief  propagation.  First,  we  minimize  the  subset  of  the  data  being 
proeessed  at  any  given  time  by  prioritizing  the  nodes  in  the  data  graph  and  processing  only  the 
subset  of  highest-priority  nodes.  This  ensures  that  we  perform  the  least  amounts  of  irrelevant 
analytical  computations.  Next,  we  eompress  the  communieation  messages  by  residual  (a  change 
from  previous  value)  or  absolute  value  of  beliefs,  thus  minimizing  communication  between 
distributed  resources.  Finally,  the  eomputational  resources  use  local  data  prioritization  to 
incrementally  process  subsets  of  data  and  reduce  the  computation  time  of  every  analysis 
superstep,  resulting  in  faster  produetion  of  near-optimal  results. 

2.2  Background  and  Functional  Requirements 

Leading  distributed  query  proeessing  models  for  sensor  networks  (Madden  et  al.,  2003;  Yao  and 
Gehrke,  2003)  try  to  aequire  as  much  data  as  possible  from  the  environment  while  most  of  that 
data  provides  little  improvement  to  approximate  answer  quality.  Henee  these  models  generate 
the  execution  costs,  in  both  time  and  resource  utilization,  that  are  orders  of  magnitude  higher 
than  is  appropriate  for  a  reasonably  reliable  answer  (Deshpande  et  al.,  2005).  While  successfully 
transitioned  to  distributed  eloud  systems,  these  models  would  not  be  appropriate  for  a 
collaborative  analysis  domain  that  has  constraints  on  computation  resources  and  communieation 
bandwidth. 

Reeently,  distributed  sensing  and  data  proeessing  has  received  significant  attention  in  both 
research  and  development  (Bryant  et  al.,  2008;  Dahm,  2010)  and  acquisition  programs.  Most 
existing  technologies  were  developed  for  raw  data  proeessing  (e.g.,  detection  of  objects  in 
imagery  based  on  networks  of  cameras;  Ding  et  al.,  2012),  sensor  plaeement  (Mathew,  and 
Surana,  2012),  or  eoordinated  planning  and  seheduling  of  homogeneous  agents  (Chen,  Levy,  and 
Decker,  2007).  These  solutions  are  inadequate  for  the  general  collaborative  data  analysis 
problem,  since  it  often  involves  diverse  datasets,  heterogeneous  eomputation  resourees,  and  may 
contain  overlapping  or  complementary  data  subsets. 

In  order  to  address  these  ehallenges  and  satisfy  the  funetional  requirements  of  this  Broad  Agency 
Announcement  (BAA)  (see  Table  2),  Aptima  developed  DSPACE;  a  system  for  distributed 
sensing  and  proeessing  within  an  adaptive  collaborative  environment.  Our  eollaborative 
distributed  data  analysis  solution  leverages  the  belief  propagation  algorithm  to  perform  graph 
mining  operations,  sueh  as  finding  inexaet  matches  to  queries  or  model  patterns,  or  learning 
frequent  consistent  patterns  in  the  data  (Levchuk,  Shabarekh,  and  Fuijanie,  2011;  Levchuk, 
Roberts,  and  Freeman,  2012;  Levchuk  et  al.,  2013).  We  exploit  the  eollaborative  nature  of  belief 
propagation  and  probabilistic  interpretation  of  the  messages  as  globally  influeneing  mechanisms 
for  local  analytical  computations. 
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Table  1:  Functional  requirements  for  distributed  collaborative  data  mining 


Requirements  in 
BAA 

What  it  means 

D-SPACE  soiution 
(Models/Algorithms) 

Benefits 

‘‘Agents  mine  knowledge 
from  their  own  collected 
data” 

Efficient  local  data 
search  and  inference 
@  agents 

Prioritization-based  Filtering: 
local  data  prioritized  /  filtered 
/  iteratively  processed 

Reduce  time  of 
data  analysis 

“Agents  represent  their 
findings  in  a  way  that  can 
be  comprehended  by 
interested  parties” 

Generalizable 
situation 
representation  & 
collaboration  @ 
agents 

Belief  Propagation  Model: 
local  inference  /  collaborative 
influence  (belief)  messages 

Enable  knowledge 
transfer 

“Agents...  identify  what 
information  should  be 
propagated  and  to  whom 
(both  individual  and  shared 
objectives )” 

Agents  send  only 
required/infiuencing 
information;  do  not 
send  their  experiences 

Belief  Propagation  Model: 
defines  what  message  are 
communicated  and  to  whom 

Obtain  optimal 
solutions  in 
distributed  manner 
and  minimize 
communication 
needs 

“...agents  to  complete 
mission  objectives  in 
dynamic  and  contested 
environments  where  . . . 
lines  of  communication  can 
change  over  time” 

Efficient 

communication  policy 
and  adaptive  belief 
update  @  agent 

Communication 

Compression,  Asvnchronous 
Collaboration,  &  Agent 

Failure  Recoverv:  adaptivelv 
compress  sent  messages, 
belief  updates  recovery  using 
delay  information,  agent  state 
monitoring 

Achieve  operation 
robust  to 
communication 
constraints 

The  idea  of  using  belief  propagation  as  a  model  for  collaborative  data  analysis  is  not  completely 
new.  Previous  research  in  this  area  developed  the  basic  building  blocks  for  such  a  system  (Crick, 
and  Pfeffer,  2003;  Pfeffer  and  Tai,  2012;  Anker,  Dolev,  and  Hodd,  2008;  Chechetka,  and 
Guestrin,  2010;  Rogers  et  ah,  2011).  Our  solution  is  different  from  previous  work  in  three  ways. 
First,  we  use  graph  matching  as  a  process  for  collaborative  querying,  as  opposed  to  full  data 
graph  labeling  employed  in  other  models.  This  enables  efficient  scalability  (our  algorithms  are 
linear  in  the  size  of  the  data)  while  maintaining  high  accuracy  of  retrieval  when  data  is  noisy  and 
queries  are  ambiguous.  For  example,  even  most  successful  heuristic  solutions  for  inferences  on 
graphs  usually  require  a  number  of  computations  of  higher-degree  polynomials  and  are  capable 
of  only  exact  pattern  search  (Aggarwal,  Khan,  &  Yan,  201 1;  Brocheler,  Pugliese,  & 
Subrahmanian,  2010;  Khan  et  ah,  2011;  Rohloff  &  Schantz,  2010,  2011).  Second,  we  developed 
several  improvements  to  collaborative  processing.  Our  derivation  of  belief  updates  makes  a 
unique  compact  formulation  which  couples  efficiently  with  communication  compression  and 
data  prioritization  operations.  We  show  that  communication  compression  using  global  belief 
values  works  as  well  or  better  than  the  heuristics  using  belief  residuals  (Elidan,  Mcgraw,  and 
Roller,  2006;  Sutton,  and  McCallum,  2007;  Gonzalez,  Low,  and  Guestrin,  2009),  and  developed 
exact  updates  for  asynchronous  message  passing  to  guarantee  optimality  under  communication 
delays.  Our  information  prioritization  model  is  equivalent  to  adaptive  partitioning  of  the  data 
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graph,  but  has  a  different  objective  of  iterative  data  processing  instead  of  parallel  execution,  and 
thus  is  different  from  static  graph  segmentation  and  indexing  solutions  (Brocheler  et  ah,  2010; 
Malewicz  et  ah,  2010)  as  well  as  dynamic  partitioning  methods  (Yang  et  ah,  2012).  Finally, 
although  not  a  part  of  the  above  requirements  we  have  developed  a  unique  model  for  distributed 
collaborative  pattern  learning,  where  the  computational  resources  learn  fragments  of  the  patterns 
in  their  individual  data  subsets  and  collaboratively  assemble  those  patterns  into  globally  coherent 
structures. 

2.3  Research  Objectives 

The  developed  DSP  ACE  system  aims  to  achieve  the  following  research  objectives: 

•  Objective  1 :  formalize  a  process  to  perform  data  querying  via  inexact  graph  matching 
using  the  belief  propagation  algorithm. 

•  Objective  2:  develop  a  distributed  collaborative  querying  model  using  the  belief 
propagation  algorithm. 

•  Objective  3:  develop  algorithms  to  define  optimized  agent  task  assignment  and 
collaboration  policy  and  establish  a  collaboration  process  using  a  standard  peer-to-peer 
communication  framework. 

•  Objective  4:  develop  strategies  and  algorithms  to  improve  scalability  and  reliability  of  the 
distributed  collaborative  process 

•  Objective  5:  develop  a  distributed  collaborative  pattern  learning  model. 

•  Objective  6:  conduct  experimental  validation  of  accuracy,  scalability  and  adaptability  of 
the  solution. 
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3.  METHODS,  ASSUMPTIONS,  AND  PROCEDURES 


3.1  Querying  Using  Inexact  Graph  Matching 

Most  web  search  engines  provide  users  the  ability  to  query  data  using  only  a  set  of  keywords. 
Keywords  are  both  a  simple  way  to  specify  the  information  needs,  and  the  input  structure  that 
requires  minimal  user  interfaces  and  little  to  no  burden  on  the  users.  Recently,  answering 
keyword  queries  on  graph-structured  data  has  emerged  as  an  important  research  topic  (Bhalotia 
et  ah,  2002;  He  et  ah,  2007).  Best  search  engines  have  been  using  the  graph  knowledge  stores 
behind  the  scenes  to  disambiguate  the  queries,  improve  the  retrieval  relevancy  and  categorizing 
the  search  results  (Singhal,  2012),  and  the  researchers  have  been  developing  approaches  to 
convert  the  keyword  queries  into  query  graphs  (Tran  et  ah,  2009).  This  change  of  search 
technologies  is  rational,  because  people  communicate  using  stories  and  not  keywords.  While 
construction  of  graphical  queries  might  be  challenging  to  general  users,  the  expert  users  can 
define  complex  queries  with  little  guidance,  and  the  solutions  to  support  their  needs  have 
advanced  from  SQL  query  language  supporting  general  database  querying  to  SPARQL*  query 
language  supporting  queries  over  RDF  triple  stores.  Intelligence  analysts  in  particular  have  the 
need  to  convert  their  needs  and  information  requirements  into  multi-attributed  query  graphs  for 
both  manual  analysis  task  distribution  and  automated  search  over  relational  data  (Levchuk  and 
Pattipati,  2013).  In  the  following  sections  we  describe  our  model  for  distributed  collaborative 
data  search,  starting  with  defining  the  centralized  querying  problem,  and  then  describing  the 
algorithms  to  distribute  the  query  among  multiple  heterogeneous  computational  resources. 

3.1.1  Formal  definition  of  data  querying  as  inexact  matching  problem 

Formally,  a  complex  query,  also  referred  to  as  model,  is  defined  as  attributed  graph  G''^  = 
where  =  (1, ... ,  M}  is  a  set  of  vertices  (representing  entities  or  entity 
requirements),  =  {(k,  m);  k,m  E  F''^}  is  a  set  of  edges  (representing  relations  between 
entities),  and  =  (a^^)  are  attributes  describing  how  the  entities  and  their  relations  may  be 
observed,  i.e.  are  attributes  (e.g.,  semantic  and  syntactic  descriptors)  of  entity  k,  and  o-km  are 
attributes  of  the  relation  (k,  m)  between  entities  k  and  m.  The  graph  G''^  encoded  the  knowledge 
we  want  to  find  in  the  data.  Similarly,  the  knowledge  aggregated  from  multiple  sources  is 
defined  using  attributed  data  graph  G*^  =  (F*^,  £’*^,^4*^),  where  =  (1, ... ,  N]  (N  »  M)  are 
entity  instances  or  mentions,  are  observed  relations  between  these  entities,  and  ^4*^  =  {afj] 
define  actually  observed  (extracted  from  text)  attributes  of  entities  a[|  and  relations  afj.  The 
mapping  from  the  query  (model)  graph  to  the  data  graph  is  defined  as  a  0-1  node-to-node 
assignment  matrix  S  =  Hs^j  ||,  where  Ski  =  1  if  the  node  in  query  k  is  mapped  to  the  entity 
mention  i,  and  0  otherwise  (Figure  2). 

We  can  interpret  the  query  graph  as  consisting  of  the  questions  (model  graph’s  nodes)  with 
supported  details.  The  mapping  is  scored  using  a  conditional  posterior  probability  value 
(Levchuk,  Roberts,  and  Freeman,  2012;  Levchuk,  and  Pattipati,  2013): 

QdS) 

P(5|D,M)  =  e  (1) 


*  http://www.w3.org/TR/rdf-sparql-querv/ 
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Here,  Q  (S')  is  a  quadratic  function  of  the  mismatch  between  the  model  graph  and  mapping- 
induced  subgraph  in  the  data,  and  the  normalization  coefficient  r](G^)  corresponds  to  the  norm 
of  the  model: 


•  Mismatch.  Q(S)  —  Xfci  ^ki^ki  ^kmij  ^ki^mj^kmij  (2) 

node  mismatch  Unk  mismatch 

•  Model  norm.  Ti(G  )  —  C/c.nuU  "f  Sfcm  ^kmjmll 

node  norm  link  norm 


Here,  the  linear  component  corresponds  to  the  mismatches  between  requested  (model)  attributes 
and  observed  (data)  attributes  -  of  the  nodes  k  E  and  i  E  V^,  i.e.  Ct^i~mismatch{afi,  a^n), 
and  the  links  (k,m)  E  and  (i,j)  E  E^,  i.e.  Ci^rni}'"mismatch{afj,  while  Cj^  nuii 

Ckm,nuii  correspond  to  the  penalties  on  query  nodes  and  links,  e.g.  the  maximum  mismatch  or  a 
mismatch  to  a  node/link  with  no  attributes. 


(a)  Querying  is  equivalent  to  finding  multiple 
subgraphs  in  data  that  match  closely  the  query  graph 


Query  (model  graph) 

Si4  =  1|  S25  =  ll  I  S35  =  1 

/ 

Observations  (data  graph) 

(b)  A  single  query  match 


X  =  X(S)  =  [4,5,6] 
GDW  = 

® — xD - xD 

(c)  A  mapping  output 


Figure  2:  Illustration  of  graph  matching  variables 
Responses  to  the  query  G''^  are  represented  as  tuple: 

(3) 

where  X  =  X(S)  =  ... ,  is  a  vector  of  the  data  variables  corresponding  to  the  queries 

and  computed  from  the  mapping  matrix  5,  i.e.  X  =  G  V^,  k  G  Sj^xk  =  l);  = 

is  a  subgraph  in  the  data  graph  induced  by  S,  i.e.  ... ,  X|^^M||, 

and  E^^^^  =  {(x^.x^)  E  E°:  k,m  E  We  desire  to  retrieve  responses  {X,  which 

maximize  posterior  probability  from  Eq.  (1),  or  equivalently  minimize  of  quadratic  mismatch 
function  Q(S)  in  Eq.  (2);  in  other  words,  we  wish  to  retrieve  the  subgraphs  of  the  data  that  match 
the  query  as  close  as  possible. 

3.1.2  Solving  inexact  matching  using  belief  propagation 

Due  to  the  factoring  of  the  objective  function  in  Eq.  (1),  its  maximization  can  be  achieved  using 
a  Smoothed  Loopy  Belief  Propagation  (SLBP)  algorithm  (Levchuk,  Roberts,  and  Freeman,  2012; 
Levchuk  and  Pattipati,  2013).  We  first  incrementally  update  the  estimates  of  marginal 
probabilities: 

bk(0  =  P(Xk  =  mM),  (4) 
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for  which  the  joint  posterior  probability  in  Eq.  (1)  is  maximized.  Marginal  probability  vector 
bk  =  G  V^]  represents  a  belief  about  location  of  query  node  k  E  in  the  observed 

dataset. 

Original  SLBP  algorithm  obtained  the  solution  to  (1)  by  passing  the  belief  messages  in  a  factor 
graph  (Levchuk,  and  Pattipati  2013).  The  factor  graph  is  constructed  based  on  factorization  of 
the  objective  function:  it  includes  the  variable  nodes  and  factor  nodes  (Figure  3).  Variable  nodes 
correspond  to  query  nodes  k  those  nodes  maintain  and  update  messages  =  [p-kiO,  i  G 

corresponding  to  the  logarithm  of  marginal  beliefs  (Mfc(0  =  log  bk(i)),  and  send  these 
messages  to  factor  nodes.  The  factor  nodes  are  defined  for  each  link  (k,  m)  E  in  the  query 
corresponding  to  the  relationship  between  query  entities;  these  nodes  maintain  and  update  two 
factor  messages:  forward  and  backward.  Forward  messages  represent  the  marginal  log- 
probabilities  of  matching  model  link  (m,  k)  to  the  data  link  that  ends  in  node  j,  fm.k)  = 
[f(m,k)(J)>j  ^  Forward  messages  represent  an  amount  of  influence  that  the  model  node 
m  EV^  exerts  on  the  decision  to  map  model  node  k  G  to  data  node  i  EV^.  Backward 
messages  represent  the  marginal  log-probabilities  of  matching  model  link  (m,  k)  to  the  data  link 
that  starts  in  node  j,  =  [T(rn,k)(j)’j  ^  Backward  messages  represent  an  amount  of 

influence  that  the  model  node  k  G  F''^  exerts  on  the  decision  to  map  model  node  m  G  F*'*  to  data 
node  G  F*^.  Essentially,  the  forward  and  backward  messages  can  be  interpreted  as  information 
influencing  the  search  of  one  query  entity  based  on  the  other  entities  that  are  linked  to  it. 


(a)  Query  (model  graph) 


Variable  nodes 


Factor  nodes 


(b)  Factor  graph 


variable  messages 


/^i 

. ■>  <■ .  . ■>  <■ . 


< . ■r>  < . ■r> 

^(1,2)  /(1, 2)  ^(2,3)  /(2,3) 

factor  messages 


(c)  Message  passing  in  BP 


1.  Receive 
messages 


2.  Compute 
beliefs 


Mm 


3.  Generate 
&  send 
messages 


(d)  Belief/message  update  iterations 


Figure  3:  Message  passing  policy  is  derived  from  the  structure  of  the  query  graph.  For  the  example  of  query 
(a),  the  factor  graph  (b)  contains  five  variable  nodes  and  four  factor  nodes.  During  message  passing,  variable 
nodes  send  their  messages  to  factor  nodes,  and  vice  versa.  In  original  formulation,  the  messages  are  passed 

between  factor  and  variable  nodes  (d) 


Formally,  the  beliefs/messages  are  updated  iteratively  in  three  steps  (Figure  3d)  using  the 
following  equations: 

•  Updates  at  variable  nodes  (loeal  beliefs): 
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MrnCO  Cmi  (l,rri)EE^  f(il,rn)(.0  ^l\  (jn,l)EE^^(rn,  o(0 

•  Updates  at  factor  nodes  (message  generation): 

^(m,k)(X)')^  (6) 

r(m,k)(j)  maXf^  ij,i)sE^{-CmkJi  +  Mfc(0  “  f(m,k)(0)-  (7) 

According  to  equations  (5-7),  in-memory  single-machine  belief  propagation  requires  a  total 
number  of  message  updates  and  memory  storage  on  the  order  of  0(max{|Vivil,  |£'m|}  x 
max^Fol,  |£'d|})  operations/variables  per  iteration,  while  the  number  of  iterations  to 
convergence  is  on  the  order  of  the  length  of  longest  path  in  model  network.  Thus,  since  the  size 
of  the  query  graph  is  usually  small,  the  belief  propagation  algorithm  has  worst-case  linear 
complexity  in  the  size  of  the  data  graph. 

In  our  model,  we  perform  the  “smoothing”  of  the  belief  updates  using  reinforcement  learning, 
where  the  messages  Py^ii)  computed  via  Eq.  (5)  at  time  t  are  used  to  incrementally  update  the 
smoothed  belief  estimates  /2^  based  on  message  estimates  calculated  at  time  t  —  1: 

Mm  =  (1  -  a)Mm  ^  +  CtiAn  (8) 

The  Eq.  (8)  attempts  to  avoid  the  errors  introduced  by  cycles  in  the  factor  graph,  which  otherwise 
would  create  oscillations  in  the  beliefs  Mm(0.  and  also  provides  the  effective  instrumentation  for 
accounting  for  message  passing  delays.  This  results  in  using  a  weighted  history  of  estimates: 

Mm  =  (1  -  OcYltm  +  OC  I]Ui(l  -  «:)''“^Mm  (9) 

When  the  algorithm  terminates  at  iteration  T,  we  compute  the  marginal  probabilities  in  Eq.  (4) 
from  beliefs: 

hfc(0  oc  ePmCO  (10) 

3.1.3  Generating  query  output 

The  query  tuples  {X,  are  generated  using  probability  values  [hfc(0]»  ^  Vi  ^ 

using  several  methods,  including  K-best  maximum  weight  assignment,  marginal  probability 
sampling,  conditional  sampling,  and  belief  re-evaluation.  We  describe  these  methods  below. 

K-best  maximum  weight  assignment 

We  approximate  the  total  probability  of  matching  as  the  product  of  marginal  probabilities: 

Pr(A)  ~  OmEV"'  ^m(^m)  (1 1) 

Then  the  matching  can  be  found  as  a  linear  assignment  that  maximizes  the  total  reward 
SfcEVM  Mm(^m)-  To  minimize  the  complexity  of  solution  to  assignment  problem,  we  filter  the 
dataset  by  finding  the  subset  of  feasible  data  points: 

=  [i  E  V^\ maxfc  6^(0  >  threshold]  (12) 

We  then  solve  a  0-1  assignment  problem: 

maxSmEvMSiEvDMmCO^fci  (13) 

.  (SiEV^  ^mi  ~  1 

^■'1  S„i£{0,l) 
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The  set  of  K  ranked  “best”  solutions  to  (13)  can  be  obtained  using  Murty’s  algorithm  (Murty, 
1968),  which  has  a  computational  complexity  of  0  ^  (Pascoal,  Captivo,  and  Climaco, 

2003).  Since  the  set  is  usually  very  small,  this  computational  complexity  is  tractable  even 
for  large  datasets.  The  resulting  mappings  X  are  then  re-ordered  using  the  mismatch  cost 

The  K-best  assignment  above  may  result  in  the  solutions  that  are  far  away  from  optimal,  when 
there  are  multiple  data  subgraphs  close  to  the  query  graph,  or  subgraph  permutations  with  the 
same  mismatch  cost  (often  the  case  for  homogeneous  and/or  symmetrical  data/query).  To  avoid 
this  issue,  the  sampling  methods  described  below  can  produce  improved  set  of  mappings. 

Marginal  probability  sampling 

In  this  method,  we  generate  the  mapping  X  =  X2, ... ,  j  using  binomial  distribution, 

where  for  each  k  G  we  generate  the  value  G  by  drawing  samples  with  probabilities 
Pi  =  Pr(Xfc  =  0  =  We  then  compute  a  cost  of  the  resulting  matching  Q(,X),  and  select  the 
set  of  mappings  X  that  have  the  smallest  score. 

Conditional  sampling 

To  derive  this  method,  we  draw  inspiration  from  collapsed  Gibbs  sampling.  First,  we  rewrite  the 
mismatch  of  mapping  X  as: 

~  2fc£V'''Siev^  ^ki^ki  "f  2(fc,7n)£E''' 2(ij)££D  (14) 

where  Accordingly,  if  a  subset  of  the  mapping  variables  X'^~^  = 

{xi,  X2, ... ,  is  known  {K  <  N),  we  can  find  the  expected  value  of  the  conditional  mismatch 
as 

E{Q(X\X>^-^))  = 

^p{x\x’^-  S/c=l  ^k=K^iEVj)  ^ki^k(~0 

~^Yi(k,m)eE^\^j-Xxk,j)EE^  ^/c,7n;x/c J 0)  '^ik,m)EE^\'^i-Xi,Xrri)^E^  ^/c,m;i,x/c^/c(0  (1^) 

k<K,m>K  k>K,m<K 

^(k,m)EE^\  ^(i,j)EE^  ^k,m;X]^,Xm  ^{k,m)EE^\  ^(iJ)EE^  ^/c,m;i7^/c(0^m0*) 

k<K,m<K  k>K,m>K 

Moreover,  we  observe  that  eonditioning  on  speeifie  value  Xj^  =  i  yields: 

E(QliX\X^-\xj,  = 

~^Yik<KXk,K)EEM  ^k,K;Xk,i  '^ik,m)EE^\'^jXx]^,j)EE^ 

k<K,m>K 

^k,m;i,Xk^k(.0  ^m<KXK,m)EE^  ^K,m;i,Xk  (16) 

k>K,m<K 

S(/c,7n)G£'^^:  '^iiJ)EE^  ^k,m;Xk,Xm  '^ik,m)EE^\  '^(iJ)EE^  ^k,m]i]^k  (Oh  m  O') 

k<K,m<K  k>K,m>K 

+  ^m>K-.(K,m)eE^^j-.(iJ)€E^  ^K.m-,ijbm(j^  +  T,k>KXk,K)eE^^j-.(J,i)€E^  ^k.KJibkiJ) 

Hence,  we  can  compute  a  conditional  sampling  probability  for  variable  as  follows: 
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Pi  =  Pr(xK-  = 

(^k<K-.(k,K)eE^  ^k,K)Xk,i  ^m<K-.(K,m)€E^  ^K,m)i,Xk  \ 

+  Tjk>K\(k,K)€E^'^j\{J,i)€E^  ^k,K-,iibkii^  +  T,m>K-.(K,m)€E^^jXi,j)€E^  ^K,m;ijbm(j) J 

Then,  the  for  each  K  EV^  we  generate  the  value  G  by  drawing  samples  with  probabilities 
Pi  computed  in  Eq.  (17). 

Belief  re-evaluation 

The  sampling  methods  described  above  incorrectly  use  the  global  marginal  posterior  values  in 
the  place  of  local  marginal.  That  is,  the  posterior  probabilities  must  be  re-evaluated  during  the 
mapping  generation  to  obtain  accurate  estimates  of  =  Pr(xK^  =  D,  M). 


We  can  achieve  this  by  “continuing”  belief  propagation  algorithm’s  iterations  over  a  small  subset 
of  data  points  Vd  found  as  in  Eq.  (12)  by  thresholding  the  marginal  posterior  probabilities.  The 
updates  will  be  substituting  the  mismatch  values  in  Equations  (4)-(6)  as  follows: 


•  If  we  map  m  G  to  data  i  G  V^,  i.e.  x^  =  i,  then 


^0,  if  m  =  k  and  j  =  i 
-  co,if  m  =  k  or  j  =  i 


Cmj,  otherwise 

Conducting  1-3  iterations  of  belief  propagation  will  result  in  corrections  to  the  1)^(0  values, 
which  can  be  used  in  multinomial-based  sampling  to  generate  the  remaining  mapping  values. 


3.2  Distributed  Coliaborative  Querying  Modei 

Now  that  we  have  defined  and  formalize  a  procedure  for  data  querying  using  inexact  graph 
matching  we  proceed  to  develop  a  mechanism  that  enables  querying  in  a  distributed  yet 
collaborative  fashion.  In  the  following  sections  we  will  provide  an  overview  of  the  distributed 
pattern  matching  problem  and  our  proposed  solution,  which  includes  a  formal  approach  to  task 
assignment  for  the  various  processing  units  (agents)  and  the  distributed  collaborative  formulation 
of  the  belief  propagation  algorithm. 

3.2.1  Problem  Formulation 

We  assume  that  the  data  graph  is  partitioned  into  subgraphs  such  that 

=  (18) 


Query  1 


Data 


Accessibility 


Results 


Figure  4:  Data  accessibility  and  query  results  across  partitions 
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where  U  is  a  set  of  distributed  data  stores  with  corresponding  computational  units,  or  agents  u. 
Specifically,  we  define  the  data  graph  partition  based  on  the  partition  of  the  set  of  data  nodes 
into  subsets  where  =  Uueu  ,  while  the  subsets  may  overlap  and  the  results 

to  queries  could  be  present  across  overlapping  data  subsets  (Figure  4).  Also,  we  assume  that  each 
data  subgraphs  ^d(u)^^d(u)^  contains  all  the  links  associated  with  its  nodes,  i.e. 

^D(u)  _  g  f-D.  I  g  yD(u)^j  g  p-DCit)  attributes  contain  a  subset  of 

attributes  in  for  corresponding  nodes  and  links,  that  is  =  G  For 

example,  the  data  may  be  segmented  geographically  (e.g.  when  the  nodes  are  areas  or  tracks),  in 
which  case  node  subsets  are  non-overlapping:  n  =  0  for  u  ^  v.  Or  the  data  may  be 

collected  over  the  same  areas  or  entities  but  by  different  sensors  (e.g.,  GMTI  captures  the  motion 
of  the  objects,  while  LIDAR  captures  the  characteristics  of  their  3D  shapes),  in  which  case  the 
attribute  subsets  are  non-overlapping:  n  =  0  for  u  9^:  v. 

Distributed  graph  matching  model,  as  described  below,  converts  the  query  graph  G''^  into  the 
tasks  to  assign  to  computation  units  u  E  U. 

3.2.2  Defining  tasks  for  computation  units  by  restructuring  the  factor  graph 

To  define  the  local  tasks  for  computational  units  u  E  U,we  make  the  factor  graph  such  as  in 
Figure  5a  more  compact  by  splitting  each  factor  node  (k,  m)  into  two  factors  (Figure  5b)  that 
compute  forward  and  backward  messages.  This  allows  us  to  define  the  query  assignment  task 
graph  for  queries  of  any  structure  (Figure  4c  shows  an  example  of  the  task  graph  for  example 
query  in  Figure  3a).  A  task,  denoted  as  is  defined  by  aggregating  the  computations  of: 

•  map-log  messages  timiO  fro^n  Eq.  5; 

•  messages  from  forward  factors  of  all  outgoing  links  (j^>  ^  ^*'*1  from 

Eq.  (6);  and 

•  messages  from  backward  factors  of  all  incoming  links  [r(k^rn)  (j)\'^k:  (k,  m)  G  from 
Eq.  (7). 


variable  messages 


< . 


factor  messages 


(a)  Original  factor  graph 


(b)  Modified  factor  graph 


Taskl 


Task  2 


Task  3 


"r  ">  < 7  ■>  < 

7(1,2)  ^(1,2)  7(2,3)  ^(2,3) 

messages  to  send 

(c)  Query  assignment  task  graph  for  example  in  Figure  3a 
Figure  5:  Design  of  the  tasks  for  distributed  computational  units 
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Each  task  is  assigned  to  one  or  more  computational  units,  although  in  most  settings  an 

assignment  is  made  to  a  single  unit. 

3.2.3  The  case  of  complete  data  access 

In  some  situations  the  computational  units  u  E  U  may  have  access  to  the  same  data,  i.e.  = 
G*^.  In  this  case,  we  define  and  assign  the  tasks  i  E  V^,j  E  to  compute  all 

the  mapping  and  forward-backward  messages.  Each  such  task  corresponds  to  a  single  node 
m  EV^  in  the  query  graph.  The  assignment  of  tasks  can  then  be  conducted  based  on  the 
capabilities  of  computational  units.  This  assignment  is  equivalent  to  partitioning  (or  cutting)  the 
query  graph  into  subgraphs,  and  trading-off  the  size  of  the  subgraphs  and  the  amount  of 
corresponding  communication,  which  is  in  the  order  of  number  of  links  separated  by 
corresponding  graph  cuts  (Levchuk,  and  Pattipati,  2013;  Figure  6). 


(c)  Bad  query  partitioning:  imbalanced 
workload  and  high  communication 


Figure  6:  Examples  of  query  partitioning  for  distributed  assignment 

3.2.4  Collaborative  distributed  belief  propagation 

Let’s  assume  that  computational  units  u  E  U  have  access  to  data  nodes  c  and  are 
assigned  to  process  query  nodes  c  We  make  two  notations  (Figure  7): 

•  is  the  subsets  of  data  nodes  connected  with  forward  links  from  nodes  i.e. 

^D(u)  ^  j-y  EV°:3iE  E  E°] 

•  is  the  subsets  of  data  nodes  connected  with  backward  links  to  nodes  i.e. 

|7D(m)  ^  g  yD.  jj  g  g 
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Figure  7:  Example  of  subsets  of  nodes  connected  to  and  from  data  nodes  accessed  by  the  computational  unit 

Then  we  ean  write  the  steps  to  execute  belief  propagation  in  distributed  manner: 

•  Step  1  -  Initialization;  each  computational  unit  u  E  U  initializes  the  beliefs  and 
messages: 

o  Vm  G  uniformly  initialize  beliefs  /2^  =  {/2^(0,  Vi  G 

o  Vm  G  uniformly  initialize  external  messages: 

■  Vi:  (m,  i)  G  Em  uniformly  initialize  forward  message  f^rn.i)  ~ 

{/(m,oO'),;  G  where =  -ln|K‘’(“)| 

■  Vi:  (i,  m)  G  Em  uniformly  initialize  backward  message  r^i^rn)  — 

K,m)(0,  i  e  where  rj^)0')  =  “ 

•  Step  2  -  Belief  update:  each  computational  unit  u  E  U  updates  beliefs  based  on  its 
internally  computed  messages  and  messages  received  from  other  agents,  as  described  in 
Eq.  5 

•  Step  3  -  External  message  generation:  each  computational  unit  u  E  U  computes  the 

messages  t>^sed  on  belief  messages  and  external  messages  received  from 

other  units  /(^  q,  ’r'hrn)’  described  in  Eq.  (6-7). 

•  Step  4  -  Communication;  each  computational  unit  u  E  U  makes  decisions  to 
communicate  the  message  or  not.  It  can  communicate  to  agent  v  E  U  the  following 
messages: 

o  Forward  messages:  G  E  n  (19) 

o  Backward  messages:  {^q^)(0<Vl  G  G  n  (20) 

The  units  iterate  steps  2-4  until  maximum  number  of  iterations  is  reached  or  the  convergence 
criteria  are  met.  These  four  steps  define  a  collaborative  distributed  data  analysis  because  the 
units  influence  each  other  belief  computations  using  communicated  forward  and  backward 
messages.  These  messages  encode  the  influence  that  one  unit  tries  to  exert  on  the  beliefs  of 
another  unit. 

After  the  belief  propagation  process  is  completed  at  some  time  T,  the  units  send  their  beliefs  fij^ 
to  a  unit  responsible  for  generating  a  final  solution.  It  proceeds  by  combining  received  messages, 
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computing  marginal  mapping  probabilities  using  Eq.  (10),  and  generating  query  outputs  as 
deseribed  in  Section  3.1.3. 


3.3  Scalability  Improvements  to  Collaborative  Query  Model 

Complexity  of  eollaborative  querying  using  inexaet  graph  matching  can  be  further  reduced  by 
limiting  local  computations  and  reducing  communication  requirements.  Number  of  local 
computations  can  be  redueed  at  every  iteration  if  the  computational  units  process  a  subset  of 
accessible  data  nodes.  Communication  requirements  can  be  redueed  by  compressing  the  number 
of  forward/backward  messages.  We  provide  the  details  of  these  improvements  in  the  following 
seetions. 

3.3.1  Data  Prioritization 

The  computational  complexity  of  the  original  distributed  belief  propagation  algorithm,  eomputed 
as  the  number  of  summation  and  maximization  operations,  for  both  belief  updates  and  message 
generation,  per  eaeh  model  node  and  link,  is  equal  to  0(max(|E*^*'“^  |,  |)).  While  this 

means  complexity  is  linear  in  the  size  of  the  data,  many  of  the  eomputations  are  unneeessary 
sinee  the  number  of  relevant  nodes  is  small.  We  can  reduce  this  eomplexity  by  orders  of 
magnitude  if  we  sort  the  data  nodes  using  a  priority  measure  and  select  a  subset  of  nodes 
with  highest  priorities.  We  define  a  priority  of  data  node  i  G  using  normalized  belief 
estimates  maximized  over  model  nodes: 

=  max^;U^\0  (21) 

The  notation  represents  the  normalized  belief  estimates: 

Mm  (0  =  Mm(0  -  log  Si  (22) 

These  messages  are  initialized  together  with  belief  messages:  Mm^(0  =  ~Cmi  ~  logSi 
(we  change  initialization  from  uniform  to  ineorporating  the  mismatch  values),  and  aceordingly 
priorities  are  initialized  as  =  max^{— —  log  Si  Normalization  is  essential  for 

correet  updates.  The  higher  the  priority  (i)  is,  the  more  probable  at  time  t  it  is  that  the  data 
node  i  matches  at  least  one  of  the  query  nodes  asigned  to  the  agent. 

After  data  node  priorities  are  initialized,  we  select  the  subset  of  nodes  with  highest  priorities 
yDiu.t)  construet  the  heap  to  store  the  priorities  of  remaining  nodes  every 

iteration,  as  the  new  marginal  posterior  probability  is  computed  /i^^(0  for  node  i  G  we 

update  the  priority  according  to  Eq.  (21)  and  compare  with  the  root  (best  value)  of  the 

heap.  If  the  updated  priority  is  smaller,  the  node  i  G  is  removed  from  and  its  value 

p(t+i)(Q  added  to  a  heap,  while  the  root  of  the  heap  j  is  added  to  the  “relevant  node  set” 

We  can  formally  write  the  update  of  the  relevant  set  as: 

y^DCu.Oj-gJ  _  y^D(u,t) 

yoiu.t)  \k  +  l]  =  U  A,  if  p®  (ifc)  <  p®  Ofc)  .23) 

1  yoiux)  otherwise 

^D(u,t+i)  ^  E‘^®®[|D(u,t)l] 
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where  is  a  root  of  the  heap  structure  for  nodes  [j  G  Even  considering  added 

computational  complexity  of  heap  updates,  the  reduction  from  processing  set  to  a  subset  of 
nodes  provides  equivalent  reduction  in  computational  complexity  of  updates  in  equations 


(5-7). 


3.3.2  Communication  Compression 


The  number  of  values  in  communicated  forward  and  backward  messages  from  Eq.  (19-20)  are 
equal  in  the  original  formulation  to  and  respectively,  can  be  large,  especially 

when  all  computational  units  have  access  to  the  same  data.  However,  not  all  of  these  messages 
are  relevant.  For  example.  Figure  8  shows  how  values  of  forward  belief  messages  change  over 
time:  the  support  of  relevant  values  is  gradually  reducing.  Accordingly,  we  first  use  a  simple 
filtering  by  communicating  only  a  subset  of  forward/backward  messages  with  largest  values.  The 
messages  that  are  not  received  are  replaced  with  low  bound  values.  Second,  we  reduced  the 
number  of  communicated  messages  by  quantizing  the  message  value  vector.  Quantization  uses 
the  hierarchical  clustering  of  the  data  nodes  (Figure  9b)  and  makes  the  decision  to  send  a  single 
value  for  the  group  of  nodes  from  a  cluster,  or  all  individual  values.  These  decisions  are 
generated  in  a  bottom-up  manner  using  the  metric  of  entropy  at  the  cluster  level  (Figure  9c). 
Using  the  definition  of  belief  messages,  for  a  cluster  c  and  a  set  of  data  nodes  in  this  cluster  Vj*, 
we  compute  the  entropy  for  forward  and  backward  messages  as  k)  ~  ~  f(m.k)  0)  ' 

ef(.m,k)^^  and  ^lm,k)(j^  '  respectively.  Consequently,  a  compressed 

vector  represents  a  quantization  of  the  belief  message. 


Iterationl 


t 

support 


Iterations 


Iteration? 


Iterations 


lteration9 


Figure  8:  Example  of  changes  in  the  values  of  forward  messages  over  iterations 

3.3.3  Complexity  Reduction  Using  Question-Answer  Process 

One  of  the  challenges  in  collaborative  data  querying  is  the  need  for  computational  resources  to 
communicate  significant  amounts  of  information  to  each  other.  While  this  information  can  be 
filtered  and  compressed,  there  will  still  be  a  significant  amount  of  irrelevant  information  that  one 
unit  may  send  to  another.  To  avoid  this  problem,  as  well  as  allow  units  to  better  discriminate 
seemingly  equivalent  entities/nodes  in  their  local  data,  units  can  employ  a  question-answer 
process  that  will  allow  them  to  request  information  of  interest  rather  than  push  this  information 
to  other  units.  Figure  10  shows  an  example  of  this  process  in  the  case  of  a  single  model  node 
assigned  to  each  of  two  units.  Each  unit  has  computed  a  current  estimate  of  node-to-node 
associations  between  model  and  data  graphs.  Then  the  sets  of  high-scoring  mappings  will  be 
limited,  e.g.,  data  nodes  {2,5}  and  (5,8)  for  agents  u  and  v  respectively  (dotted  red  arrows  in 
Figure  10).  Then,  instead  of  communicating  all  computed  forward/backward  messages,  each 
agent  asks  a  question  about  their  own  nodes  of  interest,  and  will  receive  only  the  messages 
corresponding  to  these  nodes.  This  will  result  in  significant  reduction  in  amount  of 
communication  between  the  agents  (4x  reduction  in  the  example  depicted  in  Figure  10,  where 
the  communication  over  dashed  directions  is  not  needed). 
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(a)  Example  data  and  query  graphs 


link 

type 

1 

2 

3 

Data  Nodes 

4  5  6  7 

8 

9 

10 

l->2 

forward 

-10 

-10 

-5 

-5 

-1 

-1 

-1 

-1 

-1 

-1 

backward 

-5 

-5 

-5 

-1 

-10 

-10 

-1 

-10 

-10 

-10 

send  this 
-value 

original  values 
have  duplicates 

Compression 
ratio  =  3/10 


Compressed  backward  2)  nisg 


(c)  Original  and  compressed  forward  2)  and  backward  2) 
message  values 


Figure  9:  Example  of  belief  message  compression 


Only  these  messages  need  to 
be  communicated 

Figure  10:  Question-answer  process  can  reduce  number  of  communicated  belief  messages 


3.4  Collaborative  Pattern  Learning 

Oftentimes  straetured  queries  may  be  weakly  designed  due  to  the  lack  of  understanding  of 
phenomena  present  in  the  dataset.  Also,  users  may  be  interested  in  finding  normal  (frequent)  and 
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abnormal  patterns  occurring  in  the  dataset  under  investigation.  This  can  be  achieved  by 
distributed  collaborative  pattern  learning. 

3.4.1  Types  of  Learning 

Depending  on  what  observed  data  is  accessible  by  which  computational  units,  we  can  distinguish 
two  classes  of  learning:  compositional  learning,  and  unified  learning  (Figure  1 1).  In 
compositional  learning  the  units  locally  learn  frequent  patterns  of  node  connections  in  the  data 
they  have  access  to,  and  then  combine  these  patterns  into  higher-level  “composite”  patterns 
constructed  by  stitching  learned  pattern  graphlets  (Figure  11a).  In  unified  learning,  units  may 
learn  the  patterns  of  the  same  type,  and  then  fuse  these  together  to  come  up  with  more  general 
frequent  graphs  (Figure  1  lb).  In  this  work,  we  focused  on  compositional  learning,  since  in  many 
of  the  problems  we  encounter,  computational  units  have  access  to  distinct  sensor  modalities. 
Accordingly,  using  a  compositional  approach  to  pattern  learning  would  be  more  appropriate  due 
to  existence  of  different  connections  across  diverse  sensor  modalities. 


Full  Data  Graph,  with 

pattern  candidates 

outlined 

(unknown  in  advance) 

Unit  1 


Unit  2 


A 


Composite  Pattern 


graph  to  agents 


Learned 

local 

Patterns 


,15 

Ab  .15 

Unit  1  Unit  2 

A  /C 

A  K 

Unified  Patterns 


A  /C 


(a)  Compositional  pattern  learning  (b)  Unified  pattern  learning 

Figure  11:  Different  types  of  pattern  learning 


3.4.2  Learning  Patterns  from  Graph  Instances 

In  previous  work,  we  developed  a  model  for  learning  frequent  graph  patterns  in  the  relational 
data  (Levchuk,  Roberts,  and  Freeman,  2012).  Our  pattern  learning  algorithm  was  based  on 
segmenting  the  graph  into  the  subgraphs  based  on  topological  matching.  The  training  corpus  was 
created  by  clustering  these  subgraphs  based  on  the  attributed  graph  similarity,  creating  the 
“instances”  from  which  the  pattern  is  learned  using  a  variant  of  expectation  maximization  (EM) 
algorithm  (Dempster,  Laird,  and  Rubin,  1977)  that  performs  iterative  update  of  model-to-data 
mappings  and  model’s  attributes.  For  this  work,  we  modified  the  learning  algorithm  to  avoid 
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computing  exact  mappings  at  every  iteration,  and  develop  a  distributed  version  of  the  learning 
model. 


Formally,  the  attributes  of  the  frequent  graph  pattern  i4'^  =  are  learned  from  a  set  of  T 
observations  of  subgraphs  G*^(t)  =  with  attributes  = 

{a-^  (t)],  which  we  call  a  learning  corpus.  We  estimate  the  attributes  of  the  pattern  M  that  most 
likely  generated  such  data  by  maximizing  posterior  probability  of  the  pattern  given  its  observed 
instances  (in  the  equations  below  we  use  notations  of  attributes  A,  instead  of  graphs  G,  for  both 
model  and  data  under  the  probability  functions  without  loss  of  generality  and  for  clarity  of  the 
expositions): 

zl''*  =  argmax^  P(i4|zl*^(t),  t  =  1,  (24) 

We  use  a  variant  of  the  EM  algorithm  that  treats  mappings  as  unobserved  variables  and 
iteratively  updates  pattern  attribute  parameters.  We  modified  EM  from  working  with  the 
likelihood  function  to  using  a  posterior  distribution.  This  was  dictated  by  the  need  to  avoid 
dealing  with  the  large  number  of  components  of  the  objective  function  in  the  expectation  step. 
The  algorithm  proceeds  in  two  iterative  steps: 


•  In  the  expectation  step,  also  called  mapping  updating  step,  we  find  components  of  the 
distribution pt  =  p(^S^  \A^,A^{t)^,  so  that  the  expected  value  of  log-posterior  function 

can  be  calculated:  Q{A\An)  =  [logP(v4,S^|i4'^(t))]. 

•  In  the  maximization  step,  also  referred  to  as  parameter  updating  step,  we  compute  new 

graph  pattern  attributes  ^4^+^  =  to  maximize  conditional  expected  log 

posterior  ^n+i  =  argmaxQ(i4|2lJ?). 


The  expectation  step  is  equivalent  to  conducting  iterations  via  Eq.  (5)-(7),  to  obtain  estimates  of 
node  and  link  probabilities,  i.e.,  beliefs  and  forward-backward  messages  values 

^(n,t)  (i),  (j),  for  mapping  at  step  n  a  model  graph  with  current  attributes  A^  = 

graph  with  attributes 


0^  - 


a 


M(n) 

mm 


(m,l 


D(t) 

ii 

+  St:(t,m)££'^ 

(- 

M(n)  D(t) 

“mfc  “i; 

(- 

M(n)  D(t) 

“mfc  ^ji 

In.t), 


(26) 

Pm'^(0-/(S)(0)  (27) 


To  complete  maximization  step,  for  the  case  of  Gaussian  error  of  attribute  generation,  we 
simplify  log-posterior  probability  computation: 


logP(A5^|^*^(t))~Sfc£7M  ,i£VDW 


D(t) 


Ski  + 


^km  ^ij 


D(t) 


SkiS 


mj 


Next,  the  expected  value  of  the  sum  is  equal  to  the  sum  of  expected  values  of  components,  hence 
we  obtain: 


Q(A\A^)  =  [logP(45^|^''(t))] 
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Then,  the  maximization  step  results  in  weighted  averaging  over  mapped  attributes  of  nodes  and 
links: 


•  Node  attributes  update:  =  ZtZiei/fCt)  (28) 

•  Link  attributes  update:  =  ZtZ(ij)£ED(t)  (29) 


Instead  of  computing  the  exact  estimates  in  the  expectation  step,  we  only  need  their  approximate 
values;  given  the  incremental  changes  to  the  learned  pattern,  this  assumption  is  warranted. 
Hence,  we  conduct  few  (2  to  3)  iterations  of  the  SLBP  algorithm  to  obtain  values 

(0.  r^k£-)  (0.  /(£^)  O')  initialized  with  previous  values  (i),  (0  and 

Thus,  the  complexity  of  attributed  graph  learning  remains  similar  to  the  original 

SLBP,  with  only  the  increase  in  the  number  of  iterations  and  introduction  of  the  computations 
from  equations  (28),(29),  which  are  of  similar  complexity  as  SLBP  iterations.  However,  this 
increase  may  be  significant  when  the  initial  estimate  of  the  model  graph  and/or  model-data 
mapping  variables  are  inaccurate. 


It  was  shown  in  (Yedida,  Freeman,  and  Weiss,  2004)  that  the  belief  propagation  (BP)  algorithm 
minimizes  the  “free  energy'”  function,  which  is  the  difference  between  the  mismatch  between 
mapped  attributes  of  model  and  data  graphs,  and  the  entropy  of  the  marginal  mapping 
distribution.  This  means  that  the  BP  algorithm  tries  to  trade-off  finding  multiple  matches  to  the 
model  graph  in  the  data,  and  assuring  these  are  “good”  matches.  This  is  exactly  the  property  we 
desire  in  learning  the  graph  patterns  from  data.  For  the  case  of  “mega-example”  training  data, 
i.e.,  a  single  large  data  graph  without  segmentation  into  “graph  instances”,  the  equations  above 
will  still  hold  if  the  symbol  notation  t  is  dropped.  Accordingly,  without  loss  of  generality,  we 
remove  this  notation  in  the  following  section. 


3.4.3  Collaborative  Distributed  Learning  Model 


The  learning  algorithm  described  in  the  previous  section  represents  centralized  pattern  learning 
that  can  be  executed  by  a  single  unit  that  has  access  to  all  nodes,  links,  and  attributes  of  the  data 
graph.  We  extended  this  algorithm  to  a  distributed  case,  where  the  unit  u  with  access  to  data 

subgraph  Q  is  maintaining  and  updating  beliefs  (i).  ’f'lkm)  (^)’  /(fc  m)  C/)-  ^ 

step,  the  following  updates  are  executed  by  unit  u  : 

•  Update  model  attributes  for  nodes  k  E  assigned  to  unit  u: 


M(u) 

^kk 
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Update  model  attributes  for  links  (k,  m)  G  between  model  nodes  assigned  to  u: 


M(u) 

“fcm 


2i(t,y)E£D(“)  Ci, 


D(m) 

ij 


O') 


Update  model  attributes  for  links  (k,  m)  G  between  model  nodes  outgoing  from 

agent  u  to  agent  v: 


M(u,r) 

^km 


l^YDiu)j^YD(y) 


D(u,v)  _ 


v^u{n) 

'{k.rri) 


(0 


+ 


fU^v(n) 

^(k,m) 


(;■) 


Q  received  from  unit  v  computed  at  unit  u 


In  addition,  the  unit  u  will  send  unit  v  learned  attributes  of  the  link  (/c,  m)  G 

between  model  nodes  k  G  and  m  G  that  originated  in  u.  Schematically,  the  set  of 

variables  to  perform  pattern  learning  is  depicted  in  the  example  in  Figure  12.  As  we  can  see, 
pattern  learning  differs  from  pattern  matching  by  the  need  to  update  model  graph  attributes 
(which  are  fixed  in  the  matching  process)  based  on  the  same  beliefs  and  messages  computed  and 
communicated  in  collaborative  matching. 


Collaboratively  learned  model  graph 


Figure  12:  Example  of  variables  and  communications  in  distributed  pattern  learning  algorithm 

The  only  additional  communication  needed  in  the  pattern  learning  model  is  that  of  attributes  of 
the  model  links  among  model  graph  nodes  controlled  and  learned  by  different  computational 
resources.  But  this  communication  can  be  avoided  if  the  receiving  agent  v  computes  the  attribute 
as  follows: 
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Q  computed  by  unit  v  received  from  unit  u 


While  this  alternative  insignifieantly  increases  the  computations  performed  at  agent  nodes,  and 
creates  situation  of  duplicate  attributes  (model  links  between  agents),  it  will  reduce 
communication  between  agents.  Note,  however,  that  the  total  run  time,  and  consequently  amount 
of  communicated  information  between  the  agents,  is  higher  in  pattern  learning  applications  due 
to  larger  number  of  iterations  needed  for  convergence. 


3.4.4  Learning  Frequent  Patterns 

Learning  multiple  frequent  subgraph  patterns  in  the  data  may  occur  naturally:  the  final  model 
graph  learned  collaboratively  by  a  group  of  units  will  include  multiple  disconnected  components. 
Each  component  could  be  declared  a  pattern  and  reported  to  the  users.  In  general,  this  is  not 
guaranteed  by  the  model  described  above.  Figure  13a  depicts  an  example  where  the  same  “local” 
pattern  at  one  unit  is  connected  differently  to  another  unit.  Learning  one  unified  pattern  will 
result  in  ambiguity  of  the  actual  structure.  Instead,  we  need  to  enable  agents  to  generate  multiple 
hypotheses  of  connections  between  substructures  learned  at  different  agents.  This  can  be  done  by 
collapsing  the  pattern  instances  learned  locally  by  each  unit  into  a  single  node,  and  then 
performing  information  propagation  of  pattern  labels  among  the  units.  This  will  create  a  “multi¬ 
stage”  learning  process,  where  the  computational  units  first  learn  the  subgraphs  in  their  local 
data,  aggregate  their  instances,  and  then  learn  the  higher-level  “compositional”  patterns  of 
connections  between  such  structures  (Figure  13b). 

One  of  the  crucial  steps  in  both  centralized  and  distributed  pattern  learning  is  the  initialization. 
During  this  step  the  computational  units  must  decide  about  the  following: 

(a)  How  many  model  nodes  are  possible  for  them  to  control? 

(b)  What  are  feasible  links  between  these  nodes? 

(c)  How  to  initialize  node  and  link  attributes? 


(a)  Phase  1:  Learn  local  patterns,  (b)  Convert  learned  instances  (c)  Phase  2:  Find  consistent 

aggregated  instances  into  nodes  compositions 


Figure  13:  Learning  patterns  in  distributed  manner 

To  solve  these  problems,  we  proceed  in  the  following  steps.  First,  we  generate  summary 
attributes  at  a  node  that  describe  the  neighborhood  of  the  node.  Second,  we  aggregate  the  nodes 
into  groups  using  standard  clustering  or  classification  algorithms  (such  as  mixture  of  Gaussians) 
over  summary  attributes.  Next,  we  aggregate  the  nodes  in  the  same  class  (or  cluster),  and  extract 
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frequent  topologieal  subgraphs  that  are  then  used  to  initialize  multiple  patterns  and  their 
corresponding  attributes  and  node-to-node  matching  probabilities. 


3.5  Software  Implementation 
3.5.1  Software  architecture 


We  started  implementation  of  DSPACE  solution  by  developing  a  modular  architecture,  the  main 
components  of  which  are  shown  in  Figure  14: 

•  User  interface  will  allow  defining  the  queries  and  displaying  the  results 

•  Query  processor  segments  the  query  and  assigns  the  analysis  tasks  to  agents 

•  BSP  controller  synchronizes  the  agent’s  operations  by  calling  the  “supersteps”  and 
generating  the  final  analysis  results 

•  Processing  units  are  individual  agents  with  access  to  segments  of  the  data,  who  execute 
belief  estimation  computations  and  collaborate  by  generating,  sending,  and  incorporating 
belief  messages 

•  Messaging  system  supports  delivery  of  the  messages  between  the  agents  in  any 
environment 


Figure  14:  SW  architecture  of  DSPACE  system 

We  conducted  the  tests  of  several  messaging  frameworks,  and  concluding  that  Java  Messaging 
Service  ActiveMQ  implementation  together  with  Protocol  Buffers  provide  the  functionality 
needed  for  collaborative  data  analysis.  Protocol  Buffers  are  a  way  of  encoding  structured  data  in 
an  efficient  yet  extensible  format.  Google  uses  Protocol  Buffers  for  almost  all  of  its  internal  RPC 
protocols  and  file  formats. 

Each  of  the  main  components  contained  a  set  of  sub-components,  depicted  in  Figure  15  and 
described  in  more  detail  in  the  following  sections. 
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(a)  User  Interface  (Ul)  (b)  Query  Processor  (QP)  (c)  BSP  Controller  (BSP) 


(d)  Processing  Unit  (PU) 


Figure  15:  Subcomponents  of  the  DSP  ACE  system 

3.5.2  Query  processor 

This  component  executes  the  following  functions: 

•  Analyze  the  query 

•  Partition  the  query  into  the  subsets  of  the  EEIs  assigned  to  different  processing  units  (PU) 
using  information  about  PU’s  capabilities  and  accessible  data 

•  Aggregate  the  results  from  multiple  PUs 

•  Create  a  response  output  and  return  results  to  users 

The  query  processor  (QP)  sends  the  query  partition  information  and  infrastructure  and  algorithm 
parameters  to  the  BSP  Controller.  QP  will  receive  the  responses  from  PUs  in  the  form  of  belief 
(probability)  messages  for  the  query  partitions  assigned  to  PUs. 

3.5.3  BSP  controller 


This  component  performs  the  following  functions: 

•  Maintain  directory  of  available  processing  units  and  the  query  parts  they  can  process 

•  Send  assignment  of  query  partitions  to  the  PUs 

•  Synchronize  the  distributed  data  analysis  by  sending  and  receive  control  messages,  which 
include  the  start  and  status  of  the  “supersteps” 

•  Receive  computation  metrics  such  as  computation  time  and  workload. 

•  Receive  the  query  results  from  PUs 

•  Send  all  query  results  to  query  processor 

3.5.4  Processing  unit 

Processing  unit  (PU)  implements  the  “agents”  -  the  distributed  computation  resources  that  have 
data  access  and  autonomously  collaborate  by  sending  belief  messages.  PUs  implement  the 
following  functions: 

•  Initialize  for  task 

o  Compute  mismatch(s) 
o  Initialize  beliefs 

•  Loop  for  super  step 

o  Update  beliefs 

o  Generate  external  messages  (/w  optimization) 
o  Send  messages 

o  Vote  done  on  super  step  (pause  for  coordination) 
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o  Check  for  done  from  controller  (if  done,  exit  loop) 
o  Receive  messages  for  next  super  step 
o  Search  (identify  relevant  nodes) 

•  Return  results 

o  Send  results  back  to  controller 
o  Send  the  state  of  the  agent  back  to  controller 

The  PUs  maintain  the  knowledge  of  agent  organization  -  i.e.,  a  directory  of  what  other 
processing  units  are  available,  what  data  they  have  access  to,  which  ones  are  connected,  and 
what  dependencies  are  between  them.  The  PUs  implement  the  distributed  data  analysis  algorithm 
based  on  Belief  Propagation  for  graph  matching.  It  also  includes  message  optimization  (using 
filtering  and  compression  methods),  and  the  components  to  compute  the  metrics  of  performance 
(accuracy  of  results)  and  process  (time  and  workload  logging). 

3.5.5  Performance  and  system  metrics 

All  of  the  components  of  DSP  ACE  have  the  ability  to  post  metrics  data  to  the  metrics  controller 
and  into  the  metrics  database.  When  the  DSPACE  system  is  processing  queries  in  test  mode,  it 
is  collecting  metrics  data  using  the  following  steps: 

•  the  DSPACE  component  calls  the  logger  with  an  event  type  and  a  message.  Examples  of 
event  types  are  Query  Submitted,  Query_Retumed,  QueryPartitioner_Started, 
QueryPartitionerCompleted) . 

•  the  logger  gets  the  log  request  and  adds  some  context  information,  such  as  where  the  log 
message  is  coming  from,  the  time  of  the  log  message  and  other  identifying  information  such 
as  the  query  id  that  is  being  processed. 

•  If  the  log  message  is  remotely  distributed  from  the  metrics  controller,  the  message  is  sent  to 
the  messaging  system  which  transports  the  message  to  the  metrics  controller.  The  message 
passing  is  done  using  the  same  message  passing  system  that  is  used  by  the  query  processor  to 
transport  messages  between  processing  units  and  the  other  controllers  (ActiveMQ). 

•  the  metrics  controller  receives  a  log  message  (either  from  a  local  component  or  a  remote 
component)  and  commits  it  to  the  metrics  database. 

At  any  time,  a  user  can  query  the  metrics  data  and  generate  charts  of  DSPACE  performance. 

The  metrics  collection  components  are  extensible  and  additional  test  and  experiment  metrics  can 
be  collected  and  stored  in  the  metrics  database.  Currently,  timing  data  is  being  collected  for 
query  runs  of  different  data  and  model  sizes.  Communication  size  metrics  are  also  being 
collected.  The  types  of  charts  that  can  be  generated  are  also  extensible,  currently  charts  of  query 
execution  times  for  different  data  and  model  sizes  have  been  implemented  (these  can  be  re-run  at 
any  time  with  new  or  additional  run  data  to  generate  new  charts)  and  charts  of  communication 
times  have  been  implemented. 

3.5.6  User  interfaces 

We  started  implementation  of  DSPACE’s  user  interfaces  (UIs)  by  identifying  the  list  of  actions 
that  users  must  be  allowed  to  perform  with  the  tool,  including: 

•  Input  the  query/pattem  by  selecting  a  file  containing  the  corresponding  pattern,  and 
designing  the  pattern  in  a  graphical  form 
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•  Select  the  data 

•  Generate  the  data  for  experiments  (randomly,  or  from  a  given  dataset) 

•  Configure  the  research  parameters  (define  parameters  such  as  the  number  of  agents, 
configure  the  analysis  algorithms,  specify  the  experiments  to  be  conducted,  etc.) 

•  Show  progress  of  the  query  (in  terms  of  the  time  remaining,  interactions  between 
distributed  collaborating  resources,  and  intermediate  results  statistics) 

•  Display  intermediate  algorithm  state  (belief-based  heatmap  over  data  network) 

•  Display  the  results 

•  Collect  status  and  analysis  metrics 

Accordingly,  DSPACE’s  UIs  included  the  following  GUI  elements: 

•  Dataset  and  Query  graph  editing 

•  Start/stop  buttons 

•  Progress  status  displays 

•  Query  results  display 

•  Measures  display 

The  interface  components  for  user’s  interaction  with  DSP  ACE  system  were  designed  using  web 
services.  We  implemented  three  user  interfaces: 

(1)  Engineering  interface  for  the  scalability  experiments.  This  is  a  web-based  interface  that 
used  cytoscape.js  (a  Javascript  version  of  the  tool  so  it  can  run  in  a  browser)  to  visualize 
the  network  (graph)  data  and  included  the  set  of  parameters  that  the  experiment  designer 
may  change  (Figure  16) 

(2)  Triple-store  data  connection  interface  for  experiments  with  Lehigh  University 
Benchmark  (LUBM)  dataset  (Figure  17) 

(3)  The  metrics  interface  to  visualize  performance  of  the  system.  The  metrics  user  interface 
is  web  based  similar  to  the  query  test  interface  (Figure  18). 

Figure  18  shows  two  different  graphs  of  query  execution  times  for  a  variety  of  data  and  model 
sizes  (this  information  was  not  collected  under  controlled  conditions,  therefore,  although  it  is 
actual  DSP  ACE  performance  data,  it  does  not  reflect  meaningful  performance  results  of 
DSPACE).  The  graphs  are  generated  from  data  extracted  from  the  metrics  database  and 
visualized  using  the  Google  Chart  library  in  the  browser.  The  top  bubble  chart  shows  the  size  of 
the  data  set  (number  of  nodes)  along  the  x-axis  and  the  size  (number  of  nodes)  in  the  query  along 
the  y-axis.  The  size  of  the  bubble  indicates  the  execution  time  to  search  for  a  pattern  of  the 
model  size  in  a  data  set  of  the  data  size.  The  color  of  the  bubble,  from  red,  a  shorter  time,  to  blue 
a  longer  time  is  also  indicated.  The  second  chart  shows  similar  results  with  each  colored  line 
representing  a  different  model  size  (number  of  nodes  in  the  query).  Note  in  this  graph,  if  there 
was  no  data  for  that  data/model  size,  the  result  is  shows  as  zero  (such  as  for  a  data  set  size  of 
2000,  query  execution  time  is  only  shown  for  a  3  node  model  query).  In  this  chart,  all  of  the  runs 
with  the  same  data  set  size  and  model  size  are  averaged  together  to  determine  the  value  to 
display  for  that  data  and  model  size  line  point 

These  user  interfaces  communicate  with  the  query  controller  components  via  websockets,  which 
allows  2-way  communication  needed  for  sending  the  query  and  receiving  a  response  (Figure  19). 
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Figure  16:  Engineering  interface  for  experiments  with  synthetic  data 
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Figure  17:  Interface  for  connecting  with  LUBM  triple  store 
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Figure  19:  Interactions  between  user  interfaces,  controller  components,  and  the  agents 
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4.  RESULTS  AND  DISCUSSION 


We  conducted  experiments  with  synthetic  data,  both  randomly  generated  one  and  the  LUBM 
benchmarking  data  that  was  designed  as  realistic  triple-store  datasets. 


4.1  Experiments  with  randomiy  generated  data 

In  the  first  set  of  experiments,  we  used  randomly  generated  query  and  data  graphs  to  validate 
three  hypotheses.  First,  we  showed  that  computational  complexity  of  our  querying  solution  is 
linear  in  the  size  of  the  data,  and  that  our  solution  is  robust  to  data  noise.  Second,  we  showed  that 
distributed  solution  produces  comparable  accuracy  to  the  centralized  graph  matching,  while 
providing  further  computational  improvements  on  distributed  peer-to-peer  networks.  Finally,  we 
showed  that  scalability  improvements  using  prioritization  and  belief  message  compression  do 
provide  reduction  in  the  local  computations  by  units  and  communication  among  them. 

To  achieve  these  goals,  we  used  the  synthetic  graph  generator  jgrapht  .  We  generated  a  set  of 
queries,  their  instances,  and  the  data  graph.  Then  the  instances,  -  the  true  matches  to  the  query 
that  we  want  the  algorithm  to  retrieve,  -  were  embedded  in  the  data  graph,  and  the  noise  added  to 
the  values  (Figure  20).  Using  synthetic  graph  data  allowed  us  to  vary  the  size  of  the  queries  and 
data  graphs,  number  of  true  query  matches  in  data,  the  amount  of  noise  in  attributes,  graph 
density  and  attribute  value  range,  etc.  Availability  of  ground  truth  allowed  scoring  the  accuracy 
of  our  solution. 


4.vny 

(a)  Queries 


nyYn 

vn*F 

(b)  Instances 


Performance  Measures 


(c)  Ground  Truth  (d)  Data  (e)  Inferences 


Figure  20:  Experimental  setup 

4.1.1  Solution  scalability 

We  assessed  our  algorithm  using  the  metrics  of  computational  run-time  (in  seconds)  for  several 
stages  of  the  algorithm,  including: 

•  Map-log  update;  this  is  the  time  to  update  belief  messages  (^^(i)  in  Eq.  (5) 

•  Message  generation:  this  is  the  time  to  generate  forward  and  backward  messages 
f(m.k}(J)  and  r(^,fc)(0  in  Eq.  (6)-(7) 

To  evaluate  impact  of  the  communication  independent  of  the  actual  network,  we  computed  the 
communication  workload  as  number  of  external  message  values  that  were  generated  and 
needed  to  be  communicated  between  the  computational  units.  Finally,  we  computed  efficiency 
scores  including  recall,  precision,  and  f-score. 


^  www.igrapht.org 
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Algorithm  run  time  wrt  increase  in  size  of  the  data 


#  nodes  in  data  graph 


(a)  Information  prioritization  (10%)  provides  order 
of  magnitude  improvement  in  run  time 


Detection  performance  (as  Max  F-score)  wrt  increase  in 
size  of  the  data  ■  Baseline 


#  nodes  in  data  graph 


(b)  Retrieval  accuracy  is  comparable  for  different 
sizes  of  the  data  graph 


Figure  21:  Results  of  scalability  experiments:  algorithm  run  time  is  linear  with  size  of  data 

Figure  21  shows  that  our  matching-based  data  querying  solution  has  run-time  complexity  linear 
with  the  size  of  the  data.  The  information  prioritization  (Figure  15  shows  results  for  10%  of  data 
nodes  actively  processed  at  every  iteration)  provides  order  of  magnitude  improvement  in  the  run 
time,  computed  as  the  sum  of  belief  message  updates  from  Eq.  (5)-(7)  in  the  centralized  case 
(Figure  21a),  at  the  expense  of  some  reduction  in  the  retrieval  accuracy  (measured  by  F-score; 
Figure  21b). 


Improvement  achieved  by  information 
prioritization:  Centralized  Algorithm 
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(a)  Effect  of  information  prioritization  on 
centralized  implementation.  #  data  nodes  =  lOOK 


Improvement  achieved  by  information 
prioritization:  Distributed  Algorithm  (single  core) 
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(b)  Effect  of  information  prioritization  on 
distributed  implementation  ran  on  a  single 
machine  with  minimal  communication  delays 
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[c)  Information  prioritization  also  decreases 
the  number  of  communicated  messages 


Comparison  of  F-scores  by  baseline  vs  distributed 
graph  mining  model  wrt  information  prioritization 
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(d)  Accuracy  of  centralized  and  distributed 
implementations  is  similar  for  different  levels 
of  information  prioritization 


Figure  22:  Improvements  in  scalability  can  be  achieved  by  using  information  prioritization  model  which 
reduces  the  amount  of  data  actively  processed  at  each  iteration 
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The  performance  of  the  algorithm  can  be  tuned,  trading-off  the  accuracy  with  computational 
complexity.  An  order  of  magnitude  reduction  in  update  time  can  be  achieved  if  the  number  of 
“relevant”  nodes  processed  at  the  computational  units  is  10%  of  the  total  number  of  data  nodes 
(Figure  22a),  while  further  reduction  in  the  size  of  actively  processed  data  nodes  provides  only 
marginal  reduction  in  computation  time.  Decentralized  implementation  had  a  run-time  on  the 
same  total  computational  resources  slightly  better  than  centralized  implementation  (Figure  22b), 
which  was  due  to  minimal  communication  delays.  However,  the  information  prioritization 
drastically  reduces  the  number  of  messages  that  must  be  communicated  between  the  units 
(Figure  22c),  hence  the  distributed  implementation  promises  to  have  linear-scale  improvement 
(in  terms  of  the  number  of  compute  units).  The  accuracy  of  distributed  implementation  was 
comparable  to  centralized  implementation  (Figure  22d)  when  the  attributes  did  not  contain  any 
noise,  which  is  essentially  a  case  on  attributed  graph  isomorphism  and  is  the  type  of  solution 
provided  by  standard  querying  engines  such  as  SPARQL. 


4.1.2  Solution  sensitivity  to  noise 


Figure  23  shows  how  the  accuracy  of  our  algorithm  is  affected  by  the  noise  in  the  data.  While  the 
baseline  algorithm  is  robust  to  noise,  the  performance  of  the  algorithms  with  scalability 
configurations  (“information  prioritization”  or  “communication  compression”)  degrades  as  the 
range  of  error  in  attribute  values  is  increasing.  In  particular,  the  degradation  is  more  pronounced 
in  the  distributed  case,  since  the  local  decisions  by  computational  units  of  which  data  nodes  are 
more  relevant  and  which  forward/backward  messages  must  be  communicated  are  becoming  less 
accurate  globally  with  increase  in  data  noise. 


(a)  Results  of  retrieval  with  data  that  contains  1  true  match  to  a  query 


(b)  Results  of  retrieval  with  data  that  contains  10  true  matches  to  a  query 


Figure  23:  Results  of  noise  sensitivity  experiments 


4.2  Experiments  with  LUBM  data 

LUBM  is  an  artificial  data  generator  of  graph  data  that  represents  entities  and  interactions  of 
universities,  staff,  students,  classes,  publishing  (Figure  24).  LUBM  graph  is  fully  connected. 
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containing  the  knowledge  fragments  that  describe  the  relations  of  and  interactions  between 
entities  (Figure  25). 


Entities: 

-  4  types 

-  Count:  17925 
Relations: 

-  9  types 

-  Count:  64391 
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Figure  24:  Example  statistic  of  a  single-university  LUBM  data  subset 


Figure  25:  Knowledge  fragments  are  subgraphs  in  LUBM  data  graph 
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We  used  DSPACE  to  analyze  the  frequeney  of  knowledge  patterns  in  LUBM  data.  We  extraeted 
the  knowledge  subgraphs  by  seleeting  key  nodes  and  the  1-2  hop  neighborhood  around  them, 
treated  these  fragments  as  queries,  and  then  tried  to  find  other  knowledge  fragments  that  matehed 
the  query.  Some  of  the  knowledge  patterns  oeeurred  frequently  in  the  data,  while  others  do  not 
(Figure  26).  Several  examples  of  these  queries  and  the  matehes  are  depieted  in  Figures  27-29. 
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Figure  26:  Knowledge  fragments  are  subgraphs  in  LUBM  data  graph 


Figure  27:  Examples  of  queries  and  matches  in  LUBM  data  -  Publication  pattern 
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Figure  28:  Examples  of  queries  and  matches  in  LUBM  data  -  Student-adviser  pattern 
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Figure  29:  Examples  of  queries  and  matches  in  LUBM  data  -  Student  attendance  pattern 
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5.  CONCLUSION 


In  this  project  we  developed  a  DSP  ACE  system  for  performing  distributed  queries  against  large-scale 
relational  (graph)  datasets  which  cannot  employ  Cloud  distributed  file  sharing  and  processing  systems. 
The  solution  to  this  problem  is  an  autonomous  peer-to-peer  network  of  computing  units  that  can  perform 
data  exploration  tasks,  such  as  retrieving  information  based  on  complex  ambiguous  queries,  or  discovery 
of  frequent  graph  patterns  in  the  data.  We  implemented  a  data  exploitation  model  as  a  distributed 
collaborative  inexact  graph  matching,  which  defines  a  collaborative  data  processing  policy  of  local 
computations  at  and  communication  between  processing  units.  We  showed  that  baseline  distributed 
implementation  has  the  same  accuracy  as  the  centralized  solution.  Both  centralized  and  distributed 
algorithms  can  achieve  improvement  in  scalability  using  priority-based  filtering  and  message 
compression  techniques,  at  the  expense  of  some  degradation  in  the  retrieval  accuracy. 

Our  current  research  is  focused  on  improvements  to  the  distributed  collaborative  graph  analysis 
algorithms,  including  distributed  graph  pattern  learning,  adaptive  prioritization  and  filtering,  and  graph 
indexing,  to  provide  further  scalability  and  accuracy  improvements. 
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LIST  OF  ACRONYMS 


ACRONYM 

DESCRIPTION 

AFRL 

Air  Force  Research  Laboratory 

SW 

Software 

JMS 

Java  Message  Service 

API 

Application  Programming  Interface 

BSP 

Bulk  Synehronous  Parallel 

ISR 

Intelligenee  Surveillance  and  Reeonnaissance 

BAA 

Broad  Ageney  Announeement 

SQL 

Struetured  Query  Language 

SPARQL 

SPARQL  Protoeol  and  RDF  Query  Language 

RDF 

Resouree  Description  Framework 

BP 

Belief  Propagation 

SLBP 

Smoothed  Loopy  Belief  Propagation 

LIDAR 

Light  Detection  And  Ranging 

GMTI 

Ground  Moving  Target  Indicator 

EM 

Expectation  Maximization 

DSPACE 

Distributed  Sensing  and  Proeessing  Adaptive  Collaborative 
Environment 

LUBM 

Lehigh  University  Benchmarking 
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